Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Improve multimodal processors - rely less on kwargs #28711

Draft
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

molbap
Copy link
Contributor

@molbap molbap commented Jan 25, 2024

What does this PR do?

This PR aims at a better control on the logic flow through Processor classes, in particular those leveraging ImageProcessor with a Tokenizer. Linked with #27768.

ImageProcessors compared to Nougat (as a reference point) have different signatures in their preprocess. One can list them here

TvltImageProcessor:
videos, patch_size, crop_size, do_center_crop, is_mixed, num_frames

IdeficsImageProcessor:
transform, image_num_channels, image_size

ViTImageProcessor:
No difference in args

Mask2FormerImageProcessor:
segmentation_maps, ignore_index, size_divisor, reduce_labels, instance_id_to_semantic_id

MaskFormerImageProcessor:
segmentation_maps, ignore_index, size_divisor, do_reduce_labels, instance_id_to_semantic_id

YolosImageProcessor:
format, return_segmentation_masks, annotations, masks_path

MobileNetV1ImageProcessor:
do_center_crop, crop_size

DeiTImageProcessor:
do_center_crop, crop_size

EfficientNetImageProcessor:
include_top, do_center_crop, rescale_offset, crop_size

BeitImageProcessor:
do_reduce_labels, do_center_crop, segmentation_maps, crop_size

MobileViTImageProcessor:
do_flip_channel_order, do_center_crop, segmentation_maps, crop_size

PerceiverImageProcessor:
do_center_crop, crop_size

DeformableDetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

EfficientFormerImageProcessor:
do_center_crop, crop_size

SegformerImageProcessor:
do_reduce_labels, segmentation_maps

LayoutLMv2ImageProcessor:
apply_ocr, ocr_lang, tesseract_config

BridgeTowerImageProcessor:
do_center_crop, size_divisor

SamImageProcessor:
segmentation_maps, pad_size, do_convert_rgb, mask_pad_size, mask_size

BlipImageProcessor:
do_convert_rgb

Owlv2ImageProcessor:
No difference in args

LayoutLMv3ImageProcessor:
apply_ocr, ocr_lang, tesseract_config

DetaImageProcessor:
format, return_segmentation_masks, annotations, masks_path

BitImageProcessor:
do_center_crop, do_convert_rgb, crop_size

ViTHybridImageProcessor:
do_center_crop, do_convert_rgb, crop_size

FuyuImageProcessor:
patch_size, padding_mode, padding_value

PvtImageProcessor:
No difference in args

Pix2StructImageProcessor:
max_patches, header_text, do_convert_rgb, patch_size

VitMatteImageProcessor:
trimaps, size_divisibility

VideoMAEImageProcessor:
videos, do_center_crop, crop_size

MobileNetV2ImageProcessor:
do_center_crop, crop_size

OneFormerImageProcessor:
segmentation_maps, ignore_index, task_inputs, do_reduce_labels, instance_id_to_semantic_id

FlavaImageProcessor:
crop_size, codebook_crop_size, codebook_rescale_factor, mask_group_max_patches, mask_group_min_patches, mask_group_max_aspect_ratio, codebook_image_mean, codebook_do_resize, return_image_mask, input_size_patches, codebook_do_center_crop, codebook_resample, mask_group_min_aspect_ratio, codebook_do_normalize, codebook_do_map_pixels, return_codebook_pixels, codebook_image_std, do_center_crop, codebook_size, codebook_do_rescale, total_mask_patches

DonutImageProcessor:
random_padding

TvpImageProcessor:
videos, crop_size, constant_values, do_flip_channel_order, do_center_crop, pad_size, pad_mode

GLPNImageProcessor:
size_divisor

PoolFormerImageProcessor:
crop_pct, do_center_crop, crop_size

CLIPImageProcessor:
do_center_crop, do_convert_rgb, crop_size

DPTImageProcessor:
ensure_multiple_of, keep_aspect_ratio, size_divisor

ViltImageProcessor:
size_divisor

Swin2SRImageProcessor:
pad_size

ImageGPTImageProcessor:
clusters, do_color_quantize

SiglipImageProcessor:
No difference in args

VivitImageProcessor:
videos, do_center_crop, offset, crop_size

ConvNextImageProcessor:
crop_pct

OwlViTImageProcessor:
do_center_crop, crop_size

ChineseCLIPImageProcessor:
do_center_crop, do_convert_rgb, crop_size

LevitImageProcessor:
do_center_crop, crop_size

ConditionalDetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

DetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

This helps standardize a bit in the first place, and then, will allow uniformizing Processors.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Models:

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - looking good!

Let me know when you want another review 🤗

@@ -39,7 +39,7 @@ def __init__(
do_rescale: bool = True,
rescale_factor: Union[int, float] = 1 / 255,
do_normalize: bool = True,
do_center_crop: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for changing the default here? I think BridgeTowerImageProcessor defaults to this being True, so would have this value, even if not passed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, was conflicted on this... this is the current version of preprocess for bridgetower in main, missing a new variable declaration:

do_center_crop if do_center_crop is not None else self.do_center_crop

In that case, the do_center_crop that was used in preprocess was the preprocess default, i.e. None instead of whatever was in the __init__, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amyeroberts on that, bridgetower did default do_center_cropto being True, but preprocess was not capturing it (it was a bug). The get_expected_values() method does not include mentions on center_crop and will crate expected uncropped values. From that

  1. Either change ...expected_values... getter to something that includes cropping in logic
  2. change default of tester to match previous behaviour

Comment on lines -78 to -80
if len(args) > 0:
images = args[0]
args = args[1:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very nice to see this go :)

)
# add pixel_values + pixel_mask
print(size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(size)

Comment on lines 470 to 492
valid_processor_keys = {
"images",
"do_resize",
"size",
"size_divisor",
"resample",
"do_rescale",
"rescale_factor",
"do_normalize",
"image_mean",
"image_std",
"do_pad",
"do_center_crop",
"return_tensors",
"data_format",
"input_data_format",
"pad_and_return_pixel_mask",
}

unused_keys = set(kwargs.keys()) - valid_processor_keys
if unused_keys:
unused_key_str = ", ".join(unused_keys)
logger.info(f"Unused or unrecognized configuration parameters: {unused_key_str}.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  • What's the reason for this being added for some image processors but not others e.g. donut also accepts kwargs?
  • Could we abstract out this check to something similar with Abstract image processor arg checks. #28843 using either inspect or an explicit class attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points!

  • no reason at all - all for adding a similar logic to all!
  • inspect is a bit slow, right? But yes, also +1 to abstracting this away, I'll move it to another PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amyeroberts was thinking about decorators: wdyt about having this as a wrapper/decorator in image_utils or elsewhere in utils?

        valid_processor_keys = inspect.getfullargspec(self.preprocess)[0]

        unused_keys = set(kwargs.keys()) - valid_processor_keys
        if unused_keys:
            unused_key_str = ", ".join(unused_keys)
            logger.info(f"Unused or unrecognized configuration parameters: {unused_key_str}.")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Perhaps instead of inspecting, we could have a class attribute with the valid_processor_keys? e.g. something like this:

class FooImageProcessor:
    _valid_processor_keys = ['bar', 'baz']
    ...

    def preprocess(..., kwargs):
        validate_kwargs(self._valid_processor_keys, kwargs)

Having a class attribute like this is something I think I'm going to end up with in the #28847 design

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good yes! self-contained classes seem worth losing the decorator

**kwargs,
) -> None:
if "pad_and_return_pixel_mask" in kwargs:
do_pad = kwargs.pop("pad_and_return_pixel_mask")
if pad_and_return_pixel_mask:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set the default value for size in the init to avoid having mutables as default arguments

Suggested change
if pad_and_return_pixel_mask:
size = {"shortest_edge": 288} if size is None else size
if pad_and_return_pixel_mask:


if text is not None:
self.current_processor = self.tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current processor behaviour is deprecated, so we don't need to set it here. In fact, we should probably create a current_processor property which shows a deprecation message when used

Suggested change
self.current_processor = self.tokenizer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's good to know! Seen it in another instance I think, I'll drop it in that case and add the message

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's definitely still around in places. For context, we used to have a behaviour when the current_processor was selected through a context manager. The context manager behaviour was removed, but there's still remnants of this, even though current_processor normally has no effect.


if text is not None:
self.current_processor = self.tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about current processors

Comment on lines 200 to 201
do_center_crop: Optional[bool] = True,
do_pad: Optional[bool] = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shouldn't be optional in the init

Suggested change
do_center_crop: Optional[bool] = True,
do_pad: Optional[bool] = True,
do_center_crop: bool = True,
do_pad: bool = True,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely, fixed in #28843 which should be merged before that one

@huggingface huggingface deleted a comment from github-actions bot Mar 22, 2024
@huggingface huggingface deleted a comment from github-actions bot Apr 16, 2024
@github-actions github-actions bot closed this May 28, 2024
@huggingface huggingface deleted a comment from github-actions bot May 28, 2024
@molbap molbap reopened this May 28, 2024
@huggingface huggingface deleted a comment from github-actions bot Jun 24, 2024
@huggingface huggingface deleted a comment from github-actions bot Jul 24, 2024
@huggingface huggingface deleted a comment from github-actions bot Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants